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CONSISTENCY OF IMPORTANCE SAMPLING ESTIMATES BASED ON 
DEPENDENT SAMPLE SETS AND AN APPLICATION TO MODELS WITH 
FACTORIZING LIKELIHOODS 


INGMAR SCHUSTER 

Natural Language Processing Group, University of Leipzig 

ABSTRACT. In this paper, I proof that Importance Sampling estimates based on dependent 
sample sets are consistent under certain conditions. This can be used to reduce variance 
in Bayesian Models with factorizing likelihoods, using sample sets that are much larger 
than the number of likelihood evaluations, a technique dubbed Sample Inflation. I evaluate 
Sample Inflation on a toy Gaussian problem and two Mixture Models. 


1. Introduction 

This paper broadens the scope of the Importance Sampling estimator by providing 
proofs that under rather mild conditions, estimates based on dependent sample sets are 
still consistent. This can be used for variance reduction in certain models classes, namely 
those that exhibit a factorizing structure in their likelihoods. The paper proceeds as fol¬ 
lows. In Section 2, standard Importance Sampling techniques as well as an iterated Impo- 
rance Sampling scheme. Population Monte Carlo, are reviewed. Section 3 first exemplifies 
which models qualify as having a factorizing structure and introduces Sample Inflation for 
these models. Sample Inflation is a technique to artificially blow up the number of sam¬ 
ples gained from few likelihood evaluations, thus attaining a much larger set of dependent 
samples. In Section 4, I proof that Importance Sampling estimates based on dependent 
samples are consistent, i.e. converge to the integral we are trying to estimate. Section 5 
reviews related work from the Population Monte Carlo literature. Finally, Section 6 eval¬ 
uates Sample Inflation on both a Gaussian toy problem as well as two Dirichlet Mixture 
Model estimations. In the conclusion, I give directions for future work. 

2. Importance Sampling 

The Importance Sampling estimator approximates the mean (alternatively: integral, ex¬ 
pected value) H of some function h with respect to some probability density /: 

H = J f(x)h(x)dx 
= E f (h(x)) 
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This is achieved by sampling from an auxiliary proposal density q. Say we have acquired 
a sample set X from q. The Importance Sampling estimator is given by 


( 1 ) 


3(X) = w(x)h(x) 


xex 


where w(x) = f(x)/q(x) is the weight function (Robert & Casella 1999). It can be used 
in case / is not given only proportionally but exactly and is a probability density (i.e. is 
non-negative and integrates to 1). A necessary condition for Importance Sampling to be 
unbiased is th at q(x) > 0 whenever f(x)h(x) ^ 0. Its variance is given by var 9 (T(X)) = 
(J 2 J\X\ (see 


Owen 


2013). To ensure finite variance, q has to have heavier tails than / 


( |Robert & Casella]] 1999| l. 

However, most times we can only compute / proportionally, as the normalizing constant 
(also called evidence or marginal likelihood) is unknown. In particular, this is often the case 
in Bayesian Inference, where the posterior over random variables is given proportionally by 
the product of prior and likelihood terms. Here, the self-normalized Importance Sampling 
estimator 


( 2 ) 


3nP0 = w l X \ H W u (x)h(x) 

' ' x£X 


can be used (Robert & Casella 


ized weight function and ws(X) = 

x{ w n(x)/wY.{X)) 2 {h{x)-3 n {X)) 2 (Owen 


1999), where w u (x) = f(x)/q(x) is the unnormal- 
((x)). A variance estimate is given by 


2013). Both standard and self-normalized 


Importance Sampling are consistent as a direct consequence of the strong law of large num¬ 
bers (see |Geweke)|1989| ). 


2.1. Population Monte Carlo. I will use the Population Monte Carlo (PMC; Cappe et al. 


2004)) paradigm in one of the experiments in the evaluation section. As PMC is not well 


known in the Machine Learning community, I will introduce it here it in a very concise 
way. However, the reader might as well skip this section at first and come back to it before 
reading section |4] See Cappe et al. (2004]) for a thorough introduction to PMC and Douc 
et al. (|2007 1 ; Marin et al. (2012 (; Iacobucci et al. (2010 1 for newer developments. 


The PMC method is based on the observation that proposal distributions for Importance 
Sampling can depend on previous samples without compromising the validity or (asymp¬ 
totic) unbiasedness of the estimator (Cappe et al. 2004). PMC works by first generating 
a population of importance samples (hence the name) from a set of proposal distributions. 
In each new generation of samples, proposal distributions can be built on previous sample 
generations. To equalize samples, an Importance Resampling step is introduced whereby 
each sample in the population is resampled with replacement with a probability propor¬ 
tional to its weight [Rubin[ ( fl987] >. A detailed description is given in Algorithm [I] The 
essential feature of the Algorithm in it is step (a): for each sample and each generation, 
an individual proposal distribution can be used, the only restriction being that it might not 
depend on samples from the same generation. In its most naive version (which I will be us¬ 
ing), PMC enables choosing choosing the proposal distributions for a new generation such 
that they are centered on samples from previous generations. Generally speaking, the aim 
when choosing proposal distributions is to minimize the variance of importance weights - 
thus avoiding infinite variance of the estimate. 
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Algorithm 1 Population Monte Carlo Algorithm 

Input: initial proposal densities, unnormalized density /, population size p, sample size 

m 

Output: list of m samples 
Initialize S = ListQ 
for t = 1 to T do 
Initialize P = List() 

Initialize W = List() 
for i = 1 to p do 

(a) select proposal distribution q ld 

(b) generate x ~ qi,t and append it to S 
append weight f{x)/q^t{x) to W 

end for 

normalize W to sum to 1 

resample p values from P with replacement with 
probability given by the corresponding value in W 
and append samples to S 

end for 

return S 


3. Models with factorizing likelihood terms 
Assume our generative model has the following structure. 

(/)~ P((j) |a 0 ) 

~ p (7jK) Vj G [1,...,K] 

di~ P{di\<j),~f,a d ) Mi e [1,...,JV] 

where di is the zth data point, there are N data points, each P represents some pa¬ 
rameterized family of distributions and o- 7 , o,{ are fixed model parameters. Then the 
posterior over the latent variables <j>, 7 is given by 

K N 

p (< t >, l \ d ) P(</«la,*,) p (7j|a 7 )]^[P(di|0,7,a d ) 

j =1 *=1 

where N is the number of data points. Now assume further that the likelihood term for 
each data point di depends exactly on one 7 j ( d, JL 7 j) and is independent of the other 
variables in 7 ( di X 7 m for m ^ j). This induces a partition on the data points and allows 
for further factorization of the likelihood term 

N K 

n p w,7,a<i)= n n P(di\(l), Jj,a d ) 

i =1 3=ldi^7i 

This model structure renders the individual 7 , conditionally independent of each other, 

(3) 7* X jj \ a ,( f>,d for i j 

Two model classes satisfying these assumptions are probabilistic matrix factorization (dis¬ 
cussed in |3.2| ) and Dirichlet Mixture Models (discussed in |3.3| l. First however, I will exem¬ 
plify an Importance Sampling method, called Sample Inflation, that is applicable whenever 
the assumptions above hold. 
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Algorithm 2 Importance Sampling for factorizing models 

Input: proposal densities q 7l ,..., q lK , unnormalized density /, sample size in 
Output: tuple (S', W) of m samples and weights 
Initialize samples list S = List() 

Initialize weights list W = List() 
while len(S) < TO do 
sample <// according to 
for j = 1 to K do 

sample 7 ' according to 
end for 

append ((/>', 7 },..., i K ) to S 

append f (</>', 7 },i K )/{q^') E[j=i Qyj (7j)) to w 

end while 

W = W/(J2 W £ W w ) {f°r self-normalized IS} 


3.1. Sample Inflation for Importance Sampling. A straight forward self-normalized Im¬ 
portance Sampler for models with factorizing likelihoods is given in Algorithm [2] If the 
density / is actually given in normalized form, the self-normalization at the end (W = 
w /CE w ew w )) can be ski PP ed - 

Now consider the following modification of Algorithm[2] instead of only generating one 
sample 7 ' from , generate two samples 7 ^, 7 ^ and append both (</>', 7 ^ 1 \ ... , 7 ^' 1 ) 

and (<f>', 7 ^ 2 \ ... , 7 ^) to the sample list S (and the accompanying weights to the weight 
list W). Contrary to first intuition, a set of samples generated this way does not jeopardize 
consistency, for the corresponding proof see section [5] The likelihood term for the sec¬ 
ond sample costs as much to compute as the likelihood term for the first sample. As the 
likelihood term is usually the most expensive part of posterior computation, we get two de¬ 
pendent samples (because the same (j)' appears in both of them) for the computational price 
of two independent samples. However, we can take advantage of the likelihood structure 
to get an overall of 2 A dependent samples. If we sample M initial samples instead, we can 
construct M h dependent samples for the price of M likelihood evaluations. This grows 
very quickly, in fact the growth is polynomial in M and exponential in K. 

For ease of illustration, consider M = 2, K = 2. The likelihood term for the first 
sample (</>', 7 ^, 72 ^) is 

II P(di\<t)',^\a d ) P(di\(j)'ad) 

diJLji diJL^ 

(O') (0) 

and for [<f >‘, 7 } ,72 )we have the likelihood 

n p{di\<t>', 7i 2 \ad) n 2 \ a d) 

diJL 71 

Reusing the factors computed for the first two samples, we can calculate the likelihoods of 
two more dependent samples, (0', 7 ^, 72 2 "*) and (4>', 7 ^, 72 '^), almost for free! 

This gives rise to the Sample Inflation method, given in Algorithm [3] In the algorithm, I 
use c as a shorthand ranging over joint samples for the random variables ,..., 7 x and 
q 1 (c) as a shorthand for g 7l (ci),..., q lK ( ck)■ A way to think about Sample Inflation is 
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Algorithm 3 Importance Sampling (Sample Inflation) 

Input: proposal densities q$, q 11 ,..., q lK , unnormalized density /, number of inde¬ 
pendent proposals for </> to, number of likelihood evaluations per independent proposal 

of (j>M 

Output: tuple (£, W) of to • M K samples and weights 
Initialize S = ListQ 
Initialize W = List() 

while len(S) < TO do 

sample (f>' according to q^ 

for j = 1 to K do 

for * = 1 to M do 

(i) 

sample 7 ? according to g 7j . 

end for 
end for 

compute set C of all M K possible joint samples from set of tuples 

{(7i ■ m e {1,..., M}} 

for c £ C do 

append ((/>', c) to S 

append /(<//, c)/ (q ( j > ((p / )q 1 (c)) to W {reuse previous likelihood factor computa¬ 
tions for /} 

end for 
end while 

w = w/(E wew w) 

return (5, W) 


that we can use the structure of the problem to get a better approximation of the marginal 
f{cf>) by averaging over an inflated sample set for 7 . 


3.2. Matrix Factorization. For illustration purposes I will discuss Factor Analysis. Other 
examples of Bayesian matrix factorization models include Gamma Process Nonnegative 
Matrix Factorization |Hoffman et ak|(]20ld|, Probabilistic Matrix Factorization Salakhut- 


dinov & Mnih (2007 1 and Poisson Factorization Gopalan et al. (2013|. The Factor Analysis 


model with k latent factors has the structure 


di = <^7i + et 

Here di £ Ci ~ N(0, E) is a residual for some covariance matrix E , <j> £ R pxfe is 
a factor loading matrix and 7 * ~ N( 0, It ) is a vector of latent factors (one for each data 
point, thus K equals the number of data points). I will not discuss the choice of priors 
on 4> and E; for a profound discussion of Factor Analysis, see Dunson 
observation is that the likelihood of d, does not depend on 7 j for j ^ i and thus teach 7 j is 
conditionally independent of all the other variables in 7 : 7 , X 7 j\a,(f>,d for i 7 ^ j. Thus, 
the assumptions from section[3]are satisfied. Sample inflation in the case of factor analysis 
works by first sampling proposals <j/ and possibly E', then sampling M proposals for each 
7 ,;. The likelihood of a single sample 7 Z - 1 ' 1 then is evaluated as N(di\(f> , 'y^ 1 \ E'). Lets say 
we have two data points d\, d 2 and two samples for each of the 7 ,. The likelihood of the 


(2006). The key 
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two joint samples (</>', £', 7 ^, 7 ^, 1) ), ((f)' , £', 7 ^, 7 ^ 2) ) is 

N(d 1 \(f>''y[ 1 ) ,'Z l )N(d 2 \cl> , 1 i 1 \'Z l ) 

and 

From the factors computed for these two samples, we get the likelihood for (<//,£', 7 ^, 
and (<//, £', 7 ^, 72 1 '*) using almost no additional computation time. In general we get M K 
samples using M likelihood evaluations. 

3.3. Dirichlet Mixture Models. In Dirichlet Mixture Models each data point is assumed 
to be generated by a mixture of K base distributions, where parameters of the base dis¬ 
tributions are given by 71 ,..., 7 k- A Dirichlet prior is placed on the mixture proportions 
1 j)^\ For each data point di a categorical variable (j) 2 ' 1 is drawn, indicating which base 
distribution it is generated from^The full generative model is 

lj ~ Go(a 7 ) Vj G 

^ (1) ~ Dir(a 0 ) 

(j) < f' > ~ Cat^ 1 -*) 

di ~ P(di 17 ^( 2 )) 

where Gq is a prior on the parameters of the I\ base distributions, (f >W £ R+, q^ £ 
{!)•••) K} and P(-| 7 ^( 2 )) is the base distribution with index (f>^ 2> (each base distribution 
could also have some global parameter ad, which I drop for notational clarity). Again, 
observe that d, does not depend on 7 j for j / <i) 2> and the assumptions from section 
[ 3 ] hold because 7 j X 7 jja 7 , aq,, (f)^, (f)^ 2 \ d for i 7 ^ j. To apply Sample Inflation to 
Dirichlet Mixture Models, one would first sample a proposal (l/ 1 ' ’ and o-' 2 ’ for each i, 
then M proposals for each 7 j, and recombine these to get M h dependent samples. 


4. Related Work 


To the best of my knowledge, a recombination of Importance Samples as suggested in 
this paper has not been proposed before. 

Generally speaking, variance reduction is an important topic in Importance Sampling and 
its descendant Population Monte Carlo. I will concentrate on the PMC case here. In the 
original paper by Cappe et al. ( 2004[ >, the approach used for variance reduction is to keep 
several markov transition kernels which generate new samples centered on previous ones 
with a different variance for each kernel. Those kernels which exhibit smaller weight 
variance are then used more often. Mixture-PMC (M-PMC; Cappe et ak| 2008] ) goes one 
step further in that it fits a Gaussian or Multivariate t mixture model to the samples from 
previous generations, generating new samples from this approximation of the posterior. D- 
Kernel PMC by Douc et al. ( 2007] ) fits a D-Kernel Mixture and can be shown to converge 
to the optimum D-Kernel Mixture. 


1 The notation differs from the usual notation in DP Mixture Models. However, I valued consistency with 

sectionjijhigher than consistency with the rest of the literature. 
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5. Importance Sampling estimators based on dependent samples 


In the literature, the sample set used for the Importance sampling estimator is often as¬ 
sumed to consist only of independently identically distributed (iid) samples. However, one 
potentially interesting (and as we will see practically very relevant) case is when samples 
are guaranteed to come from the proposal density q but are not required to be independent. 
I will first introduce some assumptions and notation for this section. 


Definition 1. Let X \,..., Xf- with fixed k be (multi- )sets of samples from some density 
(for claims about the Importance Sampling estimator, from the proposal density q). The 
samples in each X, L are assumed to be iid but the samples in the (multi-)set Xu = (J i Xi 
are not necessarily independent. Furthermore, let xlfi' 1 ' 1 = IJi=i be a sequence of 


sample sets with fixed k, m = 


X , 


and \X. 


oo. The samples in each xj m ^ 


are assumed to be iid for any m and i, but the samples in Xu might be dependent. 


Now as a first step towards proving consistency of Importance Sampling estimates based 
on dependent sample sets, we note that the normed error of any convex combination of es¬ 
timates based on iid sample sets cannot increase compared to the same convex combination 
of normed errors of individual estimates. 


Theorem 1. Let 3 be any estimator of the true quantity H. Then the normed error of a 
convex combination of estimates A;3(Aj) cannot exceed the convex combination of 

normed errors: 

k k 

Y Ailta) - H II > II Y a3(A,) - H\\ > 0 

i—1 i=1 

for any norm || • || and A* = l,Vi : A$ > 0. In particular, this implies the squared 
error of the convex combination of estimators cannot exceed the convex combination of 
squared errors. 


Proof. We have 


> 


> 


E AilP(Ai) — H\\ 


2 = 1 
k 

YW(Xi)-H) 

2=1 





k 

nE A «) 

2=1 


H II 
0 


where first inequality follows either from subadditivity and absolute homogeneity of norms 
or from Jensens inequality and the fact that norms are convex. The second inequality 
follows from the positivity property of norms. □ 

Now I will specialize Theorem|T|to the case of the (normalized) Importance Sampling 
estimator. Recall that we are trying to estimate the integral H = f f(x)h(x)dx. 
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Theorem 2. The nonned error of the estimate 3(Xu) cannot exceed the cardinality weighted 
average of nonned errors: 

E T^llP(^) - H W > IP( A 'u) - H\\ > 0 

l A ul 

where \ ■ \ signifies the cardinality of a set. 

Furthermore, the nonned error of the estimate 3 n (Xu) cannot exceed the importance 
weighted average of normed errors: 

E ^ H W ^ IPn(^u) - H\\ > 0 . 

j=l U ' 

Proof For the case of the unnormalized estimator 3, if we choose A, = X, |/1 X u and 
show J2i=! p^y3(Xj) = 3(A'u), the claim follows from TheoremjlJ Using the definition 
of the estimator 0 we have 


£ iw 0 


£ £ w ( x W x ) 


i=i 1 ^ 1 xex. 

1 


—— E w{x)h{x) 
' xex u 


ID 


3(Au) 


and thus the first claim holds. For the case of the self-normalized estimator 3 n , if 
we choose \ t = w^Xfj/w^X a ) and show Yh=\ ws(Xj ) Jn ( A ») = ^n(-^u). the claim 
follows from Theorem[l] Using the definition of the estimator Q we have 




w-z(X u) 


£ £ w ~ ix)hlx) 


w-z(X L 


E W u (x)h(x) 


x£X\j 


a 


3n(A u ) 


and thus the second claim holds. 


□ 


To get an intuition for the meaning of Theorem [2] for the case of the unnormalized 
estimator, recall that by using Algorithm^ we can get M K iid sample sets. Each of these 
is of size to, so the convex combination amounts to a simple average. Thus, we can only 
do better on average by using the samples from all sets as compared to the samples from 
only one set. This seems particularly fortunate after realizing that there is no reason to 
prefer one of the sample sets over one of the others (all of them are sampled iid from q). 

Now if the normed error cannot increase when using dependent sample sets for estima¬ 
tion, we might expect that the estimate converges in probability to the true integral H. In 
other words, we might expect that the sequence of estimates is consistent. This is indeed 
the case as stated by the following theorem. 

Theorem 3. Let 3 be the (self-normalized) Importance Sampling estimator. The sequence 
of estimates 3(X l E) for m —> oo is consistent, i.e. 

lim P(||5(X l j m) ) — H\\ > e) = 0 

m—foo 
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for all e > 0. 


3 n choose 


□ 

An important detail of Theorem[3]is that the number of sets k is fixed as to — > oo. 

One of the major reasons for choosing Importance Sampling over other simulation tech¬ 
niques is that it enables approximating model evidence (also called marginal likelihood or 
the normalizing constant of /). This is based on the identity 

F = / /WdI = / = E « (If) 

which yields the unbiased and consistent estimator 

(4) 3P0 = 7^7 f( x )/<i( x ) 

I I x£X 

Evidence estimates based on sample sets that contain dependent samples will stay consis¬ 
tent as follows from Theorem[3]by setting h(x) = 1. 

6. Evaluation 

In this section, I will evaluate Sample Inflation for two cases. As a very simple mea¬ 
sure, we will look into the performance of Sample Inflation when computing the (known) 
expectation of a two dimensional Gaussian distribution with diagonal covariance matrix. 
As a more involved case we will consider the estimation of two Dirichlet Mixture Models. 

6.1. Expectation of a multivariate Gaussian. For the two experiments in this subsection, 
20,000 samples where drawn from the respective two dimensional proposal distribution. 
These were unchanged for standard Importance Sampling estimation. For Sample Infla¬ 
tion, the sample set was partitioned into sets of 100 samples, which where inflated and 
concatenated into 2 , 000,000 dependent samples. 

As a first evaluation case I chose a multivariate normal, f = N(0, 21), as the target dis¬ 
tribution. The log evidence (log normalizing constant) was artificially set to —1000. The 
proposal distribution was a multivariate (-Distribution with the same mean and covariance 
matrix and 20 degrees of freedom, q = T(0,2J, 20). Squared bias, mean squared error 
(MSE) and variance of the estimates for the expectation of the target distribution as well as 
the evidence are given in log-log-plots in Figure [T] For estimation of the targets expecta¬ 
tion, the MSE, which subsumes variance and squared bias, clearly shows that using Sample 
Inflation is preferable to standard Importance Sampling. The picture is less clear cut for 
evidence approximation, but Sample Inflation does not seem hurt performance strongly. 


Proof. By standard results for all i and any e > 0: 

lim P(||5(X<"°) - H\\ > e) = 0 

|AU m) |-Ioo 

Now assume that This implies that for any convex combination 

lim P A' TO) ||5(X} TO) ) -H\\>e)=0 

\i= 1 / 

using = e - Now if 3 = 3 choose = |A"j r " ) |/|Xy m) | , if 3 = 

A.- m ' ) = Wy, (x!- rn> )/u’e (Aij' ni ) and apply Theorem^to get 

lim P(||5(Xi m) )-(T|| >e)=0 

m—f oo 
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log # Ihood evals log # Ihood evals log # Ihood evals 


Figure 1. Performance of Sample Inflation as compared to standard 
self-normalized Importance Sampling from the same proposal distribu¬ 
tion. The top plots give squared bias, mean squared error and variance 
for estimation of the expectation of the target distribution.The bottom 
plots plot the same measures for estimation of the evidence. The pro¬ 
posal distribution was centered on the true expectation of the target dis¬ 
tribution. 


The second experiment used the same target, / = N (0.2/) with a log evidence of 
— 1000. This time however, the proposal distribution was not centered on the target, but on 
(5, 5) t , q = T(( 5, 5) t , 21, 20). The mean squared error evaluation does not favor Sample 
Inflation for estimation of the targets expectation this time, though Sample Inflation gives 
more stable estimates (Figure [2]l. The major contribution to MSE here comes from the bias, 
which is caused by the fact that our proposal distribution is not centered on the target. For 
evidence approximation. Sample Inflation hurts performance slightly, but bear in mind that 
the differences to standard Importance Sampling are small when transformed back from 
log space. 

6.2. Estimation of DMMs. In this evaluation, I use a Population Monte Carlo approach to 
estimate two Dirichlet Mixture Models (DMMs) for synthetic data sets comprised of 100 
data points. For both experiments, 2000 samples where drawn for standard Importance 
Sampling. For Sample Inflation, after sampling <jy ^ ; and 0-' 2 ' 1 , two dependent samples 
where drawn for the parameters of the two component distributions (thus M = 2, K = 2). 
I used less overall samples for Sample Inflation than for standard Importance Sampling, so 
as to keep the number of likelihood evaluations exactly equal. 

In the first case, the synthetic data was generated from a mixture of two one dimensional 
Gaussians with different means and unit variance. The DMM used two Gaussian compo¬ 
nents with fixed unit variance. Thus, only the means of the components had to be estimated. 
I put an iV(0,1) prior on the component means. I used Gaussian Markov kernels to gen¬ 
erate proposals based on samples from earlier generations of the PMC algorithm. Sample 
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log # Ihood evals log # Ihood evals log # Ihood evals 



log # Ihood evals log # Ihood evals log # Ihood evals 


Figure 2. Performance of Sample Inflation as compared to standard 
self-normalized Importance Sampling from the same proposal distribu¬ 
tion. The top plots give squared bias, mean squared error and variance 
for estimation of the expectation of the target distribution.The bottom 
plots plot the same measures for estimation of the evidence. The pro¬ 
posal distribution was not centered on the true expectation of the target 
distribution. 


Inflation attained regions of high likelihood more quickly and exhibited lower variance 
than standard Importance Sampling (Figure [3]». 





FIGURE 3. Gaussian Mixture Model estimation. Solid lines mark Sam¬ 
ple Inflation, dashed lines standard Importance Sampling. The true 
means of the synthetic data are dotted. Using Sample Inflation, high 
likelihood regions are reached more quickly. This is reflected in the esti¬ 
mated means, which are closer to the true means of the data for Sample 
Inflation. 
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In the second case, the synthetic data was generated from a mixture of two one di¬ 
mensional T-Distributions with different means, unit variance, and 30 degrees of freedom. 
The DMM used two T components. Based on a sample S from a previous generation, I 
used T-distributed Markov kernels to generate proposals for the mean centered on value 
of the mean parameter in S. Equivalently, I used Inverse-Wishart Distributions centered 
on the covariance matrix in S and Gamma distributions centered on the degrees of free¬ 
dom in S. A Student-/: T(0,1,1) prior was placed on the component means, an Inverse 
Wishart IW(a 2 = 5, df = 1) prior on the covariance and a Gamma{ 1,1) prior (shape and 
scale parametrization) on the degrees of freedom. Here, Sample Inflation is much better 
in achieving high likelihoods more quickly, though the estimates exhibit higher variance 
(Figure^. The estimates of the means are not close to the true means of the synthetic data, 
which probably stems from the fact that the Mixture Model is very flexibly as we also es¬ 
timate the covariance matrix and degrees of freedom. Also, the IW ( a 2 = 5, df = 1) prior 
on the covariance and the Gamma( 1,1) prior on degrees of freedom are very broad and 
compensate easily for the rather narrow T(0, 1,1) prior on the component means. 



# Ihood evals 




FIGURE 4. T Mixture Model estimation. Solid lines mark Sample In¬ 
flation, dashed lines standard Importance Sampling. The true means of 
the synthetic data are dotted. Using Sample Inflation, high likelihood 
regions are reached much more quickly. 


7. Conclusion 

The contributions of this paper where twofold. First, I proved that Importance Sampling 
estimates based on dependent sample sets are consistent under mild conditions. To the best 
of my knowledge, this has not been proved before or if it has, the mainstream literature 
does not reflect this. Second, I apply this to models with factorizing likelihoods, resulting 
in Sample Inflation, a technique to generate many dependend samples from few likelihood 
evaluations. The evaluation in section 6 showed that Sample Inflation can reduce variance 
and help to attain high likelihood regions more quickly in a Population Monte Carlo set¬ 
ting. Future work will have to derive variance estimates for Sample Inflation and, as a 
consequence, measures of Effective Sample Size and perplexity |Robert & Casella]|2010| . 
This will hopefully lead to a better understanding of when Sample Inflation can help and 
under which conditions it hurts performance. 
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